A Framework for Batched and GPU-Resident Factorization Algorithms Applied to Block Householder Transformations
نویسندگان
چکیده
As modern hardware keeps evolving, an increasingly effective approach to develop energy efficient and high-performance solvers is to design them to work on many small size and independent problems. Many applications already need this functionality, especially for GPUs, which are currently known to be about four to five times more energy efficient than multicore CPUs. We describe the development of the main one-sided factorizations that work for a set of small dense matrices in parallel, and we illustrate our techniques on the QR factorization based on Householder transformations. We refer to this mode of operation as a batched factorization. Our approach is based on representing the algorithms as a sequence of batched BLAS routines for GPU-only execution. The hybrid CPU-GPU algorithms rely heavily on using the multicore CPU for specific part of the workload. But in order to benefit from the GPU’s significantly higher energy efficiency, the primary design goal is to avoid the use of the multicore CPU and to exclusively rely on the GPU. Additionally, this will result in the removal of the costly CPU-to-GPU communication. Furthermore, we do not use a single symmetric multiprocessor (on the GPU) to factorize a single problem at a time. We illustrate how our performance analysis and the use of profiling and tracing tools guided the development and optimization of batched factorization to achieve up to 2-fold speedup and 3-fold better energy efficiency compared to our highly optimized batched CPU implementations based on the MKL library (when using two sockets of Intel Sandy Bridge CPUs). Compared to a batched QR factorization featured in the CUBLAS library for GPUs, we achieved up to 5 speedup on the K40 GPU.
منابع مشابه
A Projected Alternating Least square Approach for Computation of Nonnegative Matrix Factorization
Nonnegative matrix factorization (NMF) is a common method in data mining that have been used in different applications as a dimension reduction, classification or clustering method. Methods in alternating least square (ALS) approach usually used to solve this non-convex minimization problem. At each step of ALS algorithms two convex least square problems should be solved, which causes high com...
متن کاملNORTH- HOLLAND High Performance Algorithms for Toeplitz and Block Toeplitz Matrices
In this paper, we present several high performance variants of the classical Schur algorithm to factor various Toeplitz matrices. For positive definite block Toeplitz matrices, we show how hyperbolic Householder transformations may be blocked to yield a block Schur algorithm. This algorithm uses BLAS3 primitives and makes efficient use of a memory hierarchy. We present three algorithms for inde...
متن کاملParallel Implementation of Particle Swarm Optimization Variants Using Graphics Processing Unit Platform
There are different variants of Particle Swarm Optimization (PSO) algorithm such as Adaptive Particle Swarm Optimization (APSO) and Particle Swarm Optimization with an Aging Leader and Challengers (ALC-PSO). These algorithms improve the performance of PSO in terms of finding the best solution and accelerating the convergence speed. However, these algorithms are computationally intensive. The go...
متن کاملOrthonormal integrators based on Householder and Givens transformations
We consider refined implementations of algorithms based on Householder and Givens transformations to find the Q-factor in the QR factorization of a matrix solution of linear time dependent differential systems. After discussing the algorithms, we introduce a suite of integrators, QRINT, and provide numerical testing to show the efficiency and accuracy of our techniques.
متن کاملA New Much Faster and Simpler Algorithm for Lapack Dgels
We present new algorithms for computing the linear least squares solution to overde-termined linear systems and the minimum norm solution to underdetermined linear systems. For both problems, we consider the standard formulation min kAX ? BkF and the transposed formulation min kA T X ? BkF , i.e, four diierent problems in all. The functionality of our implementation corresponds to that of the L...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2015